28 research outputs found

    Neighbor-Dependent Ramachandran Probability Distributions of Amino Acids Developed from a Hierarchical Dirichlet Process Model

    Get PDF
    Distributions of the backbone dihedral angles of proteins have been studied for over 40 years. While many statistical analyses have been presented, only a handful of probability densities are publicly available for use in structure validation and structure prediction methods. The available distributions differ in a number of important ways, which determine their usefulness for various purposes. These include: 1) input data size and criteria for structure inclusion (resolution, R-factor, etc.); 2) filtering of suspect conformations and outliers using B-factors or other features; 3) secondary structure of input data (e.g., whether helix and sheet are included; whether beta turns are included); 4) the method used for determining probability densities ranging from simple histograms to modern nonparametric density estimation; and 5) whether they include nearest neighbor effects on the distribution of conformations in different regions of the Ramachandran map. In this work, Ramachandran probability distributions are presented for residues in protein loops from a high-resolution data set with filtering based on calculated electron densities. Distributions for all 20 amino acids (with cis and trans proline treated separately) have been determined, as well as 420 left-neighbor and 420 right-neighbor dependent distributions. The neighbor-independent and neighbor-dependent probability densities have been accurately estimated using Bayesian nonparametric statistical analysis based on the Dirichlet process. In particular, we used hierarchical Dirichlet process priors, which allow sharing of information between densities for a particular residue type and different neighbor residue types. The resulting distributions are tested in a loop modeling benchmark with the program Rosetta, and are shown to improve protein loop conformation prediction significantly. The distributions are available at http://dunbrack.fccc.edu/hdp

    Phylogeny, phylogeography and hybridization of Caucasian barbels of the genus Barbus (Actinopterygii, Cyprinidae)

    Get PDF
    Phylogenetic relationships and phylogeography of six species of Caucasian barbels, the genus Barbus s. str., were studied based on extended geographic coverage and using mtDNA and nDNA markers. Based on 27 species studied, matrilineal phylogeny of the genus Barbus is composed of two clades ¿ (a) West European clade, (b) Central and East European clade. The latter comprises two subclades: (b1) Balkanian subclade, and (b2) Ponto-Caspian one that includes 11 lineages mainly from Black and Caspian Sea drainages. Caucasian barbels are not monophyletic and subdivided for two groups. The Black Sea group encompasses species from tributaries of Black Sea including re-erected B. rionicus and excluding B. kubanicus. The Caspian group includes B. ciscaucasicus, B. cyri (with B. goktschaicus that might be synonymized with B. cyri), B. lacerta from the Tigris-Euphrates basin and B. kubanicus from the Kuban basin. Genetic structure of Black Sea barbels was influenced by glaciation-deglaciation periods accompanying by freshwater phases, periods of migration and colonization of Black Sea tributaries. Intra- and intergeneric hybridization among Caucasian barbines was revealed. In the present study, we report about finding of B. tauricus in the Kuban basin, where only B. kubanicus was thought to inhabit. Hybrids between these species were detected based on both mtDNA and nDNA markers. Remarkably, Kuban population of B. tauricus is distant to closely located conspecific populations and we consider it as relic. We highlight revealing the intergeneric hybridization between evolutionary tetraploid (2n=100) B. goktschaicus and evolutionary hexaploid (2n=150) Capoeta sevangi in Lake Sevan.The study was supported by Russian Science Foundation (grant no. 15-14-10020); final stage of the study was supported by Russian Foundation for Basic Research (grants nos. 18-54-05003 and 19-04-00719)

    A new clustering and nomenclature for beta turns derived from high-resolution protein structures.

    No full text
    Protein loops connect regular secondary structures and contain 4-residue beta turns which represent 63% of the residues in loops. The commonly used classification of beta turns (Type I, I', II, II', VIa1, VIa2, VIb, and VIII) was developed in the 1970s and 1980s from analysis of a small number of proteins of average resolution, and represents only two thirds of beta turns observed in proteins (with a generic class Type IV representing the rest). We present a new clustering of beta-turn conformations from a set of 13,030 turns from 1074 ultra-high resolution protein structures (≤1.2 Å). Our clustering is derived from applying the DBSCAN and k-medoids algorithms to this data set with a metric commonly used in directional statistics applied to the set of dihedral angles from the second and third residues of each turn. We define 18 turn types compared to the 8 classical turn types in common use. We propose a new 2-letter nomenclature for all 18 beta-turn types using Ramachandran region names for the two central residues (e.g., 'A' and 'D' for alpha regions on the left side of the Ramachandran map and 'a' and 'd' for equivalent regions on the right-hand side; classical Type I turns are 'AD' turns and Type I' turns are 'ad'). We identify 11 new types of beta turn, 5 of which are sub-types of classical beta-turn types. Up-to-date statistics, probability densities of conformations, and sequence profiles of beta turns in loops were collected and analyzed. A library of turn types, BetaTurnLib18, and cross-platform software, BetaTurnTool18, which identifies turns in an input protein structure, are freely available and redistributable from dunbrack.fccc.edu/betaturn and github.com/sh-maxim/BetaTurn18. Given the ubiquitous nature of beta turns, this comprehensive study updates understanding of beta turns and should also provide useful tools for protein structure determination, refinement, and prediction programs

    Multifaceted analysis of training and testing convolutional neural networks for protein secondary structure prediction.

    No full text
    Protein secondary structure prediction remains a vital topic with broad applications. Due to lack of a widely accepted standard in secondary structure predictor evaluation, a fair comparison of predictors is challenging. A detailed examination of factors that contribute to higher accuracy is also lacking. In this paper, we present: (1) new test sets, Test2018, Test2019, and Test2018-2019, consisting of proteins from structures released in 2018 and 2019 with less than 25% identity to any protein published before 2018; (2) a 4-layer convolutional neural network, SecNet, with an input window of ±14 amino acids which was trained on proteins ≤25% identical to proteins in Test2018 and the commonly used CB513 test set; (3) an additional test set that shares no homologous domains with the training set proteins, according to the Evolutionary Classification of Proteins (ECOD) database; (4) a detailed ablation study where we reverse one algorithmic choice at a time in SecNet and evaluate the effect on the prediction accuracy; (5) new 4- and 5-label prediction alphabets that may be more practical for tertiary structure prediction methods. The 3-label accuracy (helix, sheet, coil) of the leading predictors on both Test2018 and CB513 is 81-82%, while SecNet's accuracy is 84% for both sets. Accuracy on the non-homologous ECOD set is only 0.6 points (83.9%) lower than the results on the Test2018-2019 set (84.5%). The ablation study of features, neural network architecture, and training hyper-parameters suggests the best accuracy results are achieved with good choices for each of them while the neural network architecture is not as critical as long as it is not too simple. Protocols for generating and using unbiased test, validation, and training sets are provided. Our data sets, including input features and assigned labels, and SecNet software including third-party dependencies and databases, are downloadable from dunbrack.fccc.edu/ss and github.com/sh-maxim/ss

    BioAssemblyModeler (BAM): user-friendly homology modeling of protein homo- and heterooligomers.

    No full text
    Many if not most proteins function in oligomeric assemblies of one or more protein sequences. The Protein Data Bank provides coordinates for biological assemblies for each entry, at least 60% of which are dimers or larger assemblies. BioAssemblyModeler (BAM) is a graphical user interface to the basic steps in homology modeling of protein homooligomers and heterooligomers from the biological assemblies provided in the PDB. BAM takes as input up to six different protein sequences and begins by assigning Pfam domains to the target sequences. The program utilizes a complete assignment of Pfam domains to sequences in the PDB, PDBfam (http://dunbrack2.fccc.edu/protcid/pdbfam), to obtain templates that contain any or all of the domains assigned to the target sequence(s). The contents of the biological assemblies of potential templates are provided, and alignments of the target sequences to the templates are produced with a profile-profile alignment algorithm. BAM provides for visual examination and mouse-editing of the alignments supported by target and template secondary structure information and a 3D viewer of the template biological assembly. Side-chain coordinates for a model of the biological assembly are built with the program SCWRL4. A built-in protocol navigation system guides the user through all stages of homology modeling from input sequences to a three-dimensional model of the target complex.http://dunbrack.fccc.edu/BAM

    Minimum sequence identity within Pfam domain families identified in PDB structures with PDBfam.

    No full text
    <p>Kernel density estimates of the minimum sequence identifies are shown for a total of 3730 Pfams. The sequence identities were determined by alignment of pairs of PDB sequences to the same HMM (curve labeled HMM) or by structure alignment with the program FATCAT.</p

    Minimum sequence identities in ProtCID clusters of common interfaces between identical Pfams (A) or different Pfams (B).

    No full text
    <p>The minimum sequence identity for each pair of domains in the same interface cluster using both FATCAT and transitive alignment via the Pfam HMM(s). For Pfam pairs, the minimum of the two Pfams was used in the density estimate. Same-Pfam clusters may contain homodimers and/or heterodimers that belong to the same Pfam domain family.</p
    corecore